Search CORE

14 research outputs found

Prema novom jednojezičnom mađarskom objasnidbenom rječniku: pregled mađarskih objasnidbenih rječnika

Author: Lipp Veronika
László Simon
Publication venue: 'Leksikografski zavod Miroslav Krleza'
Publication date: 01/01/2021
Field of study

The Lexical Knowledge Representation Research Group at the Department of Lexicology is one of the youngest research groups of the Hungarian Research Centre for Linguistics, founded in February 2020. The group is currently working on a new version of a monolingual explanatory dictionary partly based on The Explanatory Dictionary of the Hungarian Language. The aim is to compile an up-to-date online dictionary of contemporary Hungarian (2001–2020) by corpus-driven methods. The present article describes The Explanatory Dictionary of the Hungarian Language and the Comprehensive Dictionary of Hungarian by presenting their history, the circumstances of their compilation, and the basic editorial guidelines. Then it outlines how the corpus for the planned dictionary is to be set up and how this corpus is to be analysed.Istraživačka skupina za prikaz leksičkog znanja jedna je od najmlađih istraživačkih skupina Mađarskog istraživačkog centra za lingvistiku, osnovana u veljači 2020. Skupina trenutno radi na novoj inačici jednojezičnoga objasnidbenog rječnika proizišloga iz Objasnidbenoga rječnika mađarskog jezika. Cilj joj je kompilirati moderan i ažuriran mrežni rječnik mađarskog jezika (2001–2020) koristeći se korpusom vođenim metodama. Članak opisuje Objasnidbeni rječnik mađarskog jezika i Velikog rječnika mađarskog jezika predstavljanjem njihove povijesti, okolnosti u kojima su kompilirani, te osnovnih uredničkih načela. Potom skicira kako će se organizirati i analizirati korpus planiranoga rječnika

HRČAK - Portal of Croatian Scientific and Professional Journals

Hrčak - Portal of scientific journals of Croatia

Comprehensive Dictionary of Hungarian

Author: Lipp Veronika
Publication venue: Wydawnictwa Uniwersytetu Warszawskiego
Publication date: 01/01/2018
Field of study

Repository of the Academy's Library

Korpusztisztítás és sorvégi kötőjelek kezelése karakteralapú neurális nyelvmodellel

Author: Lipp Veronika
Pethő Gergely
Sass Bálint
Simon László
Publication venue
Publication date: 01/01/2023
Field of study

Cikkünk célkitűzése kettős: egyrészt bemutatunk egy olyan egyszerű és általános módszert, amellyel karakteralapú nyelvmodellek hasznosíthatóak egyebek mellett korpuszok tisztításában, másrészt ismertetünk egy olyan konkrét, tiszta magyar sajtónyelvi korpuszon tanított nyelvmodellt, amelyre építve jó eredményeket értünk el e módszer alkalmazásával. Továbbá nyilvánosan elérhetővé tesszük az akár karakter-, akár szószintű rekurrens neurális nyelvmodellek konfigurálását és (újra)tanítását szolgáló, Pythonban írt alkalmazást, amellyel a nyelvmodellünket tanítottuk, és amelynek segítségével akár ez a magyar sajtónyelvi modell hozzáigazítható más jellegű tanítókorpuszokhoz, akár új modell tanítható be. A bemutatott kétirányú LSTM-nyelvmodell erőforrásigénye aránylag szerény, és a javasolt módszert követve közvetlenül, vagyis az adott részfeladatra történő bármilyen további betanítás nélkül jól használható a korpusztisztítás során felmerülő feladatok széles körére, például idegen nyelvű, túl sok zajt tartalmazó szövegrészek azonosítására, szórványos OCR-hibák és hiányzó ékezetek javítására. A nyelvmodellt a sorvégi elválasztások egyértelműsítése feladatra értékeltük ki: a módszer teljesítménye ezen a feladaton meghaladta a nagyon magas baseline-t

University of Szeged

An Unsupervised Approach to Characterize the Adjectival Microstructure in a Hungarian Monolingual Explanatory Dictionary

Author: Héja Enikő
Ligeti-Nagy Noémi
Lipp Veronika
Simon László
Publication venue: Lexical Computing
Publication date: 01/01/2023
Field of study

Repository of the Academy's Library

Magyar hadifoglyok adatainak orosz-magyar átírása és helyreállítása, és a szabadszöveges adatbázisok tulajdonságai

Author: Halász Dávid
Kalivoda Ágnes
Lipp Veronika
Mittelholcz Iván
Sass Bálint
Publication venue
Publication date: 01/01/2021
Field of study

Ebben a tanulmányban a magyar hadifoglyok adatbázisában lévő tulajdonnevek orosz-magyar átírásának módszerét és tanulságait mutatjuk be. Az adatbázisban a 682000 hadifogoly adatai cirill betűkkel leírva állnak rendelkezésre. Az adatok két körben szenvedtek torzulást: először, amikor az adatokat felvevő szovjet katona hallás utána leírta, majd mikor 60 év múltán szintén orosz anyanyelvűek manuális munkával digitalizálták az anyagot a kézzel írott kartonokról. Esetünkben nem szimpla átírásról van szó, hanem valójában az eredeti magyar szó helyreállításáról. Külön feladatot jelentett a helyeket leíró adatok adatmezőkre bontása. Szabályalapú algoritmusunkban szigorú és laza átírást, valamint közelítő keresést alkalmazunk, az átírást listákkal vetjük össze. Ha egyik módszer sem vezet eredményre, akkor a buta betűről-betűre átírást adjuk vissza. Eredmény: az adatok 77%-ához tudtunk helyes helyreállított alakot rendelni. Megfogalmazunk tanulságot a kézzel készült, korlátozatlan, szabadszöveges adatbázisok szükségszerű következetlenségéről; valamint arról, hogy egyedi adatnál, tanulóadat híján van létjogosultsága a szabályalapú módszereknek

University of Szeged

Magyar hadifoglyok adatainak orosz-magyar átírása és helyreállítása, és a szabadszöveges adatbázisok tulajdonságai

Author: Halász Dávid
Kalivoda Ágnes
Lipp Veronika
Mittelholcz Iván
Sass Bálint
Publication venue: Szegedi Tudományegyetem TTIK Informatikai Intézet
Publication date: 01/01/2021
Field of study

Repository of the Academy's Library

Igekötő-kapcsolás

Author: Kalivoda Ágnes
Lipp Veronika
Pethő Gergely
Sass Bálint
Simon László
Publication venue
Publication date: 01/01/2022
Field of study

Ahogy ebben a mondatban is látjuk, a magyarban az igekötő el tud válni az igéjétől. A korpuszok alapvető egysége a szó, a token, emiatt a magyar nyelvű korpuszokban hagyományosan mindig külön tokenként jelenik meg az elvált igekötő. Azt az alapvető információt tehát, hogy melyik igéhez tartozik az adott igekötő, ezek a korpuszok nem tartalmazzák. Jelen tanulmányban egyrészt (1) módszert adunk az igekötők alapigéhez kapcsolására, másrészt (2) egy sémát arra, hogy ezt hogyan jelenítsük meg a korpusz annotációjában a korpuszban való keresést leginkább támogató módon. Az eszközt az e-magyar rendszer moduljaként implementáltuk, olyan funkcionalitást valósít meg, amelyre számos kutatásban merül fel igény. Az elkészült emPreverb modul és a kézzel annotált fejlesztő és tesztkorpuszok szabadon felhasználhatók

University of Szeged

Designing the ELEXIS Parallel Sense-Annotated Dataset in 10 European Languages

Author: Federico Martelli
Győrffy András
Jelena Kallas
Lipp Veronika
Polona Gantar
Roberto Navigli
Simon Krek
Simon László
Váradi Tamás
Publication venue: Lexical Computing
Publication date: 01/01/2021
Field of study

Repository of the Academy's Library

Tamás Váradi and the International Lexicography

Author: Lipp Veronika
Publication venue: Nyelvtudományi Kutatóközpont
Publication date: 01/01/2021
Field of study

Repository of the Academy's Library

Parallel sense-annotated corpus ELEXIS-WSD 1.0

Author: Costa Rute
Dobrovoljc Kaja
Frontini Francesca
Gantar Polona
Győrffy András
Kallas Jelena
Koeva Svetla
Koppel Kristina
Krek Simon
Langemets Margit
Lipp Veronika
László Simon
Martelli Federico
Monachini Monica
Munda Tina
Navigli Roberto
Nimb Sanni
Olsen Sussi
Quochi Valeria
Salgado Ana
Sancho-Sánchez José-Luis
Sandford Pedersen Bolette
Tempelaars Rob
Tiberius Carole
Ureña-Ruiz Rafael
Váradi Tamás
Üksik Tiiu
Čibej Jaka
Publication venue: Jožef Stefan Institute
Publication date: 28/07/2022
Field of study

ELEXIS-WSD is a parallel sense-annotated corpus in which content words (nouns, adjectives, verbs, and adverbs) have been assigned senses. Version 1.0 contains sentences for 10 languages: Bulgarian, Danish, English, Spanish, Estonian, Hungarian, Italian, Dutch, Portuguese, and Slovene. The corpus was compiled by automatically extracting a set of sentences from WikiMatrix (Schwenk et al., 2019), a large open-access collection of parallel sentences derived from Wikipedia, using an automatic approach based on multilingual sentence embeddings. The sentences were manually validated according to specific formal, lexical and semantic criteria (e.g. by removing incorrect punctuation, morphological errors, notes in square brackets and etymological information typically provided in Wikipedia pages). To obtain a satisfying semantic coverage, we filtered out sentences with less than 5 words and less than 2 polysemous words were filtered out. Subsequently, in order to obtain datasets in the other nine target languages, for each selected sentence in English, the corresponding WikiMatrix translation into each of the other languages was retrieved. If no translation was available, the English sentence was translated manually. The resulting corpus is comprised of 2,024 sentences for each language. The sentences were tokenized, lemmatized, and tagged with POS tags using UDPipe v2.6 (https://lindat.mff.cuni.cz/services/udpipe/). Senses were annotated using LexTag (https://elexis.babelscape.com/): each content word (noun, verb, adjective, and adverb) was assigned a sense from among the available senses from the sense inventory selected for the language (see below) or BabelNet. Sense inventories were also updated with new senses during annotation. List of sense inventories BG: Dictionary of Bulgarian DA: DanNet – The Danish WordNet EN: Open English WordNet ES: Spanish Wiktionary ET: The EKI Combined Dictionary of Estonian HU: The Explanatory Dictionary of the Hungarian Language IT: PSC + Italian WordNet NL: Open Dutch WordNet PT: Portuguese Academy Dictionary (DACL) SL: Digital Dictionary Database of Slovene The corpus is available in a CONLL-like tab-separated format. In order, the columns contain the token ID, its form, its lemma, its UPOS-tag, its whitespace information (whether the token is followed by a whitespace or not), the ID of the sense assigned to the token, and the index of the multiword expression (if the token is part of an annotated multiword expression). Each language has a separate sense inventory containing all the senses (and their definitions) used for annotation in the corpus. Not all the senses from the sense inventory are necessarily included in the corpus annotations: for instance, all occurrences of the English noun "bank" in the corpus might be annotated with the sense of "financial institution", but the sense inventory also contains the sense "edge of a river" as well as all other possible senses to disambiguate between. For more information, please refer to 00README.txt

Common Language Resources and Technology Infrastructure - Slovenia